Next: Character Sets, Previous: Character Codes, Up: Non-ASCII Characters [Contents][Index]
A character property is a named attribute of a character that specifies how the character behaves and how it should be handled during text processing and display. Thus, character properties are an important part of specifying the character’s semantics.
On the whole, Emacs follows the Unicode Standard in its implementation of character properties. In particular, Emacs supports the Unicode Character Property Model, and the Emacs character property database is derived from the Unicode Character Database (UCD). See the Character Properties chapter of the Unicode Standard, for a detailed description of Unicode character properties and their meaning. This section assumes you are already familiar with that chapter of the Unicode Standard, and want to apply that knowledge to Emacs Lisp programs.
In Emacs, each property has a name, which is a symbol, and a
set of possible values, whose types depend on the property; if a
character does not have a certain property, the value is
nil. As a general rule, the names of character
properties in Emacs are produced from the corresponding Unicode
properties by downcasing them and replacing each
‘_’ character with a dash
‘-’. For example,
Canonical_Combining_Class becomes
canonical-combining-class. However, sometimes we
shorten the names to make their use easier.
Some codepoints are left unassigned by the UCD—they don’t correspond to any character. The Unicode Standard defines default values of properties for such codepoints; they are mentioned below for each property.
Here is the full list of value types for all the character properties that Emacs knows about:
nameCorresponds to the Name Unicode property. The
value is a string consisting of upper-case Latin letters A to
Z, digits, spaces, and hyphen ‘-’
characters. For unassigned codepoints, the value is
nil.
general-categoryCorresponds to the General_Category Unicode
property. The value is a symbol whose name is a 2-letter
abbreviation of the character’s classification. For
unassigned codepoints, the value is Cn.
canonical-combining-classCorresponds to the Canonical_Combining_Class
Unicode property. The value is an integer. For unassigned
codepoints, the value is zero.
bidi-classCorresponds to the Unicode Bidi_Class
property. The value is a symbol whose name is the Unicode
directional type of the character. Emacs uses this
property when it reorders bidirectional text for display (see
Bidirectional
Display). For unassigned codepoints, the value depends on
the code blocks to which the codepoint belongs: most
unassigned codepoints get the value of L (strong
L), but some get values of AL (Arabic letter) or
R (strong R).
decompositionCorresponds to the Unicode properties
Decomposition_Type and
Decomposition_Value. The value is a list, whose
first element may be a symbol representing a compatibility
formatting tag, such as small16; the other
elements are characters that give the compatibility
decomposition sequence of this character. For characters that
don’t have decomposition sequences, and for unassigned
codepoints, the value is a list with a single member, the
character itself.
decimal-digit-valueCorresponds to the Unicode Numeric_Value
property for characters whose Numeric_Type is
‘Decimal’. The value is an integer,
or nil if the character has no decimal digit
value. For unassigned codepoints, the value is
nil, which means NaN, or
“not a number”.
digit-valueCorresponds to the Unicode Numeric_Value
property for characters whose Numeric_Type is
‘Digit’. The value is an integer.
Examples of such characters include compatibility subscript
and superscript digits, for which the value is the
corresponding number. For characters that don’t have
any numeric value, and for unassigned codepoints, the value
is nil, which means NaN.
numeric-valueCorresponds to the Unicode Numeric_Value
property for characters whose Numeric_Type is
‘Numeric’. The value of this
property is a number. Examples of characters that have this
property include fractions, subscripts, superscripts, Roman
numerals, currency numerators, and encircled numbers. For
example, the value of this property for the character
U+2155 (VULGAR FRACTION ONE
FIFTH) is 0.2. For characters that
don’t have any numeric value, and for unassigned
codepoints, the value is nil, which means
NaN.
mirroredCorresponds to the Unicode Bidi_Mirrored
property. The value of this property is a symbol, either
Y or N. For unassigned codepoints,
the value is N.
mirroringCorresponds to the Unicode
Bidi_Mirroring_Glyph property. The value of this
property is a character whose glyph represents the mirror
image of the character’s glyph, or nil if
there’s no defined mirroring glyph. All the characters
whose mirrored property is N have
nil as their mirroring property;
however, some characters whose mirrored property
is Y also have nil for
mirroring, because no appropriate characters
exist with mirrored glyphs. Emacs uses this property to
display mirror images of characters when appropriate (see
Bidirectional
Display). For unassigned codepoints, the value is
nil.
paired-bracketCorresponds to the Unicode
Bidi_Paired_Bracket property. The value of this
property is the codepoint of a character’s paired
bracket, or nil if the character is not a
bracket character. This establishes a mapping between
characters that are treated as bracket pairs by the Unicode
Bidirectional Algorithm; Emacs uses this property when it
decides how to reorder for display parentheses, braces, and
other similar characters (see Bidirectional
Display).
bracket-typeCorresponds to the Unicode
Bidi_Paired_Bracket_Type property. For
characters whose paired-bracket property is
non-nil, the value of this property is a symbol,
either o (for opening bracket characters) or
c (for closing bracket characters). For
characters whose paired-bracket property is
nil, the value is the symbol n
(None). Like paired-bracket, this property is
used for bidirectional display.
old-nameCorresponds to the Unicode Unicode_1_Name
property. The value is a string. For unassigned codepoints,
and characters that have no value for this property, the
value is nil.
iso-10646-commentCorresponds to the Unicode ISO_Comment
property. The value is either a string or nil.
For unassigned codepoints, the value is nil.
uppercaseCorresponds to the Unicode
Simple_Uppercase_Mapping property. The value of
this property is a single character. For unassigned
codepoints, the value is nil, which means the
character itself.
lowercaseCorresponds to the Unicode
Simple_Lowercase_Mapping property. The value of
this property is a single character. For unassigned
codepoints, the value is nil, which means the
character itself.
titlecaseCorresponds to the Unicode
Simple_Titlecase_Mapping property. Title
case is a special form of a character used when the
first character of a word needs to be capitalized. The value
of this property is a single character. For unassigned
codepoints, the value is nil, which means the
character itself.
This function returns the value of char’s propname property.
(get-char-code-property ?\s 'general-category)
⇒ Zs
(get-char-code-property ?1 'general-category)
⇒ Nd
;; U+2084 SUBSCRIPT FOUR
(get-char-code-property ?\u2084 'digit-value)
⇒ 4
;; U+2155 VULGAR FRACTION ONE FIFTH
(get-char-code-property ?\u2155 'numeric-value)
⇒ 0.2
;; U+2163 ROMAN NUMERAL FOUR
(get-char-code-property ?\u2163 'numeric-value)
⇒ 4
(get-char-code-property ?\( 'paired-bracket)
⇒ 41 ;; closing parenthesis
(get-char-code-property ?\) 'bracket-type)
⇒ c
This function returns the description string of property
prop’s value, or nil
if value has no description.
(char-code-property-description 'general-category 'Zs)
⇒ "Separator, Space"
(char-code-property-description 'general-category 'Nd)
⇒ "Number, Decimal Digit"
(char-code-property-description 'numeric-value '1/5)
⇒ nil
This function stores value as the value of the property propname for the character char.
The value of this variable is a char-table (see Char-Tables) that
specifies, for each character, its Unicode
General_Category property as a symbol.
The value of this variable is a char-table that specifies, for each character, a symbol whose name is the script to which the character belongs, according to the Unicode Standard classification of the Unicode code space into script-specific blocks. This char-table has a single extra slot whose value is the list of all script symbols.
The value of this variable is a char-table that specifies the width of each character in columns that it will occupy on the screen.
The value of this variable is a char-table that specifies,
for each character, whether it is printable or not. That is,
if evaluating (aref printable-chars char)
results in t, the character is printable, and if
it results in nil, it is not.
The Unicode specification writes these tag names inside ‘<..>’ brackets, but the tag names in Emacs do not include the brackets; e.g., Unicode specifies ‘<small>’ where Emacs uses ‘small’.
Next: Character Sets, Previous: Character Codes, Up: Non-ASCII Characters [Contents][Index]